Dimension Reduction for Taxonomy Data – Main Results
This document contains the results of the application of the UMAP dimension reduction (also called “ordination”) method to the microbiome data. An overview of all evaluated alternative methods (which lead to similar results as UMAP, though) can be found in a separate document.
How to read it:
UMAP performs a (dis)similarity analysis on relative abundances. In the estimation process, every sample is compared to every other sample in the data based on the similarity of their bacteria species profile. The more similar two samples are, the closer their points are plotted in the different graphs.
Note: For UMAP as a nonlinear method there is no nontrivial measure to estimate how much information is lost by focusing on two dimensions only. For the PCoA methods the first two dimensions only explain 15% of the overall variation in the data, i.e. 85% of the information is lost when focusing only on these two dimensions. Accordingly, results should be interpreted with care.
As of this, minor differences in the plotted ellipses should not be overinterpreted, but instead interpretation should focus on more global patterns like neighboring families / genuses or the visualized age and richness patterns plotted for the UMAP results.
Data preparation and UMAP estimation
Code
library(sauerkrautTaxonomyBuddy)library(SummarizedExperiment) # microbiome analysislibrary(mia) # microbiome analysislibrary(vegan) # (dis)similarity measureslibrary(scater) # dimension reduction and visualizationslibrary(dplyr) # data handlinglibrary(tidyr) # data transformationlibrary(ggplot2) # data visualizationlibrary(ggpubr) # joint ggplotslibrary(patchwork) # joint ggplotslibrary(kableExtra) # table printing# set ggplot2 themetheme_set(theme_minimal() +theme(plot.title =element_text(hjust =0.5),plot.subtitle =element_text(hjust =0.5),panel.grid.minor =element_blank(),plot.background =element_rect(fill ="white", color ="white")))
# add logarithmized versions of the SCFA variables to the dataset, for plottingdat <- dat %>%mutate(stool_methylbutyricAcid_log =log(stool_methylbutyricAcid),stool_aceticAcid_log =log(stool_aceticAcid),stool_butyricAcid_log =log(stool_butyricAcid),stool_hexanoicAcid_log =log(stool_hexanoicAcid),stool_isobutyricAcid_log =log(stool_isobutyricAcid),stool_isovalericAcid_log =log(stool_isovalericAcid),stool_propionicAcid_log =log(stool_propionicAcid),stool_valericAcid_log =log(stool_valericAcid))colData(tse) <- dat %>%DataFrame()
In the following ‘flatulence after fresh / pasteurized intervention’ and ‘better digestion after fresh / pasteurized intervenion’ plots all measurements of a person are plotted and based on the specific variable (e.g. if a person had flatulence after the fresh intervention) all timepoints of this person are colored accordingly.
Code
# flatulence after interventionsgg_famList$covar_flatulence_afterFresh + gg_famList$covar_flatulence_afterPast
Code
# better digestion after intervenionsgg_famList$covar_betterDigestion_afterFresh + gg_famList$covar_betterDigestion_afterPast